
This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
Processes and methodologies act as the skeleton framework on which successful projects are built.
The cross-industry standard process for data mining (CRISP-DM) methodology is an open standard process model that describes common approaches used by data mining experts. In this project, we utilize the CRISP-DM methodology.
CRISP-DM breaks down into six phases.
<img src="CRISP-DM_Process_Diagram.png" title = "CRISP-DM" width= 400 height = 400 alt="By Kenneth Jensen - Own work based on: ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610"/>
Focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan.
For the current scenario, we are going to:
There are two methods in which Data Understanding phase practiced:
Starts with an initial data collection and proceeds with activities to get familiar with the data, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
We recognize specific interesting questions and then collect data related to those questions.
The transformation from Business to Data understanding phase is not linear; instead, it is cyclic.
In the current project, we are going to utilize only the data provided by Starbucks as it is challenging to work with inherent limitations in data. Thereby we are practicing first method.
The data is contained in three files:
Here is the schema and explanation of each variable in the files:
portfolio.json
profile.json
transcript.json
import pandas as pd
import numpy as np
import math
import json
!pip install joblib
from joblib import dump, load
from sklearn.pipeline import Pipeline, make_union
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer,r2_score, mean_squared_error, f1_score, classification_report, accuracy_score
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor, RandomForestClassifier, RandomForestRegressor, \
GradientBoostingClassifier, GradientBoostingRegressor, ExtraTreesClassifier, ExtraTreesRegressor
from sklearn.multioutput import MultiOutputClassifier
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import gc
import warnings
warnings.filterwarnings('ignore')
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
The data preparation phase covers all activities to construct the final dataset from the initial raw data. Data preparation is 80% of the process.
Data wrangling is the core action in this phase. There is no one way to perform data wrangling, as a rule of thumb we will approach Data Wrangling in two steps:
Data Wrangling is part of Data Understanding and Data Preparation phases of the CRISP-DM model and is the first programming step.
Data Wrangling is language and framework independent, and there is no one right way. In our case, we are using Python as the programming language of choice and Pandas as the data manipulation framework.
I am going to divide Data Wrangling into three steps:
Data Wrangling is a cyclic process, and often we need to revisit the steps again and again.
We will perform the Data Wrangling on all three of the data sources provided by Starbucks.
portfolio.info()
portfolio.isnull().any()
portfolio.head(10)
From visual and programmatic assessment, Portfolio data set has only ten rows with no missing data.
However, the data is not in machine learning friendly structure. We are going to apply one hot encoding methodologies on channels and offer_type columns.
1. Create dummies for offer_type column
Code
portfolio_for_ml = pd.get_dummies(portfolio,columns=['offer_type'])
portfolio_for_ml.rename(columns={'offer_type_bogo':'bogo',
'offer_type_discount':'discount',
'offer_type_informational':'informational'},
inplace=True)
Test
portfolio_for_ml.head(10)
2. Split Chanels colmns into email, mobile, social and web columns
Code
def channels_email(data):
email, mobile, social, web = 0, 0, 0, 0
if 'email' in data:
return 1
else:
return 0
def channels_mobile(data):
if 'mobile' in data:
return 1
else:
return 0
def channels_social(data):
if 'social' in data:
return 1
else:
return 0
def channels_web(data):
if 'web' in data:
return 1
else:
return 0
portfolio_for_ml['email'] = portfolio_for_ml.channels.apply(lambda x: channels_email(x))
portfolio_for_ml['mobile'] = portfolio_for_ml.channels.apply(lambda x: channels_mobile(x))
portfolio_for_ml['social'] = portfolio_for_ml.channels.apply(lambda x: channels_social(x))
portfolio_for_ml['web'] = portfolio_for_ml.channels.apply(lambda x: channels_web(x))
portfolio_for_ml.drop(columns=['channels'],inplace=True)
Test
portfolio_for_ml.head(10)
3. Instead of using id (offer ID) we will map dummy values
Code
portfolio_for_ml['offer_code'] = (portfolio_for_ml.index.values+1)
Test
portfolio_for_ml.head(10)
We will consolidate all the cleaning steps into one single function.
Code
def generate_portfolio_for_ml(portfolio = portfolio.copy()):
portfolio_for_ml = pd.get_dummies(portfolio,columns=['offer_type'])
portfolio_for_ml.rename(columns={'offer_type_bogo':'bogo',
'offer_type_discount':'discount',
'offer_type_informational':'informational'},
inplace=True)
portfolio_for_ml['email'] = portfolio_for_ml.channels.apply(lambda x: channels_email(x))
portfolio_for_ml['mobile'] = portfolio_for_ml.channels.apply(lambda x: channels_mobile(x))
portfolio_for_ml['social'] = portfolio_for_ml.channels.apply(lambda x: channels_social(x))
portfolio_for_ml['web'] = portfolio_for_ml.channels.apply(lambda x: channels_web(x))
portfolio_for_ml.drop(columns=['channels'],inplace=True)
portfolio_for_ml['offer_code'] = (portfolio_for_ml.index.values+1)
return portfolio_for_ml
Test
portfolio_for_ml_1 = generate_portfolio_for_ml()
portfolio_for_ml.equals(portfolio_for_ml_1)
del portfolio_for_ml_1
gc.collect()
portfolio_for_ml.to_csv('portfolio_for_ml.csv',index=False)
profile.info()
profile.isnull().any()
def plot_age(profile = profile):
trace = go.Histogram(x=profile.age.values,
name='Age',
marker=dict(color='rgba(95,158,209,1)',))
layout = go.Layout(title = 'Age Distribution',
xaxis=dict(title='Age'))
fig = go.Figure(data=go.Data([trace]), layout=layout)
iplot(fig)
def plot_gender(profile = profile):
trace = go.Histogram(x=profile.gender.values,
name='Gender',
marker=dict(color='rgba(95,158,209,1)',))
layout = go.Layout(title = 'Gender Distribution',
xaxis=dict(title='Gender'))
fig = go.Figure(data=go.Data([trace]), layout=layout)
iplot(fig)
def plot_income(profile = profile):
trace = go.Histogram(x=profile.income.values,
name='Income',
marker=dict(color='rgba(95,158,209,1)',))
layout = go.Layout(title = 'Income Distribution',
xaxis=dict(title='Income'))
fig = go.Figure(data=go.Data([trace]), layout=layout)
iplot(fig)
def plot_join_date(profile = profile):
trace = go.Histogram(x=profile.became_member_on,
name='Became Member on',
marker=dict(color='rgba(95,158,209,1)',))
layout = go.Layout(title = 'Membership Join Date Distribution',
xaxis=dict(title='Membership Join Date'))
fig = go.Figure(data=go.Data([trace]), layout=layout)
iplot(fig)
plot_age()
plot_gender()
plot_income()
(profile.age==118).sum(),profile.gender.isnull().sum(),profile.income.isnull().sum()
profile[(profile.age==118) & (profile.gender.isnull()) & (profile.income.isnull())].shape
profile[(profile.age==118) & (profile.gender.isnull()) & (profile.income.isnull())].shape[0]/ profile.shape[0]
From the visual assessment, in the Profile Data set:
From the programmatic assessment, in the Profile Data set:
Following fixes will be implemented on the Profile Data set in clean phase:
The data is not in machine learning friendly structure. We will create a new ML friendly Pandas Data frame with following changes:
1. Drop rows with missing values, which should implicitly drop rows with age '118.'
Code
profile.dropna(inplace=True)
profile.reset_index(inplace=True,drop=True)
Test
(profile.age==118).sum(),profile.gender.isnull().sum(),profile.income.isnull().sum()
profile[(profile.age==118) & (profile.gender.isnull()) & (profile.income.isnull())].shape
plot_age()
plot_gender()
plot_income()
2. Convert became_member_on to Pandas DateTime datatype.
Code
profile['became_member_on'] = pd.to_datetime(profile.became_member_on,format='%Y%m%d')
Test
profile.info()
profile.head()
plot_join_date()
3. Create machine learning friendly profile.
Code
profile_for_ml = profile.copy()
profile_for_ml = pd.get_dummies(profile_for_ml,columns=['gender'])
profile_for_ml['became_member_on_year'] = profile_for_ml.became_member_on.dt.year
profile_for_ml['became_member_on_month'] = profile_for_ml.became_member_on.dt.month
profile_for_ml['became_member_on_date'] = profile_for_ml.became_member_on.dt.day
profile_for_ml.drop(columns=['became_member_on'], inplace=True)
Test
profile_for_ml.head()
We will consolidate all the cleaning steps into one single function.
def clean_profile(profile = profile.copy()):
profile.dropna(inplace=True)
profile['became_member_on'] = pd.to_datetime(profile.became_member_on,format='%Y%m%d')
return profile
def generate_profile_for_ml(profile = clean_profile(profile= profile.copy())):
profile_for_ml = profile.copy()
profile_for_ml = pd.get_dummies(profile_for_ml,columns=['gender'])
profile_for_ml['became_member_on_year'] = profile_for_ml.became_member_on.dt.year
profile_for_ml['became_member_on_month'] = profile_for_ml.became_member_on.dt.month
profile_for_ml['became_member_on_date'] = profile_for_ml.became_member_on.dt.day
profile_for_ml.drop(columns=['became_member_on'], inplace=True)
return profile_for_ml
Test
profile_for_ml_1 = generate_profile_for_ml(clean_profile())
profile_for_ml.equals(profile_for_ml_1)
profile_for_ml_1 = generate_profile_for_ml()
profile_for_ml.equals(profile_for_ml_1)
del profile_for_ml_1
gc.collect()
profile_for_ml.to_csv('profile_for_ml.csv',index=False)
transcript.info()
transcript.isnull().any()
transcript.head()
# Uncomment below code if you want to print the value_counts
#transcript.value.value_counts()
def plot_events(transcript = transcript):
trace = go.Histogram(x=transcript.event.values,
name='Event',
marker=dict(color='rgba(95,158,209,1)',))
layout = go.Layout(title = 'Event Distribution',
xaxis=dict(title='Events'))
fig = go.Figure(data=go.Data([trace]), layout=layout)
iplot(fig)
plot_events()
From a visual and programmatic assessment, there are no data issues in the Transcript data set.
However, the data whether a promotion influenced the user is not defined. A user is deemed to be influenced by promotion only after the individual made a transaction after viewing the advertisement.
In the cleaning step:
Clean
def get_offer_id(data):
try:
return data['offer id']
except KeyError:
try:
return data['offer_id']
except:
return ''
def get_reward(data):
try:
return data['reward']
except KeyError:
return 0
def get_amount(data):
try:
return data['amount']
except KeyError:
return 0
def get_duration(offer_id):
if offer_id.strip() != '':
return portfolio[portfolio.id == offer_id]['duration'].values[0]
else:
return 0
transcript_clone = transcript.copy()
1. Create dummy columns out of event column
Code
transcript_clone =pd.get_dummies(transcript_clone,columns=['event'])
transcript_clone.rename(columns={'event_offer completed':'offer_completed',
'event_offer received':'offer_received',
'event_offer viewed':'offer_viewed',
'event_transaction':'transaction'},
inplace=True)
Test
transcript_clone.head()
2. "value" column is a composite column that contains Offer ID, Reward and Amount information. We will extract information into individual columns
Code
transcript_clone['offer_id'] = transcript_clone.value.apply(get_offer_id)
transcript_clone['reward'] = transcript_clone.value.apply(get_reward)
transcript_clone['amount'] = transcript_clone.value.apply(get_amount)
transcript_clone.drop(columns=['value'],inplace=True)
transcript_clone = transcript_clone[['person','time','offer_id','offer_received','offer_viewed','offer_completed',
'transaction','reward','amount']]
Test
transcript_clone.head()
3. When an individual has utilized an offer, there are two transactions records created, one for claiming the reward another for making the purchase. We are going to consolidate these two transaction records into one.
When we look below output, we can see that for person 0009655768c64bdeb2e877511632db8f at time 414 there are two records one for purchase and one for claiming the reward.
Once the data frame is cleaned, this should not be the case.
transcript_clone.sort_values(['person','time']).head(20)
Code
transcript_clean = transcript_clone.groupby(['person','time'],as_index=False).agg('max')
Test
transcript_clean.sort_values(['person','time']).head(20)
4. Each offer is valid only for a certain number of days once received. In the current data frame, we do not have this information. For successful completion of the offer, the offer should be utilized before expiration.
Code
transcript_clean['duration'] = transcript_clean[transcript_clean.offer_received == 1].offer_id.apply(get_duration)
transcript_clean.duration.fillna(0,inplace=True)
transcript_clean['duration'] = transcript_clean.duration.apply(lambda x:x*24)
transcript_clean['expiration'] = transcript_clean.time + transcript_clean.duration
transcript_clean.drop(columns='duration',inplace=True)
transcript_clean = transcript_clean[['person', 'time', 'expiration','offer_id', 'offer_received', 'offer_viewed',
'offer_completed', 'transaction', 'reward', 'amount']]
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
Test
transcript_clean.head(20)
From the above output, it looks like we have populated the transactions that are not offer received with the transaction timestamp. We need to fill with correct offer expiration time if offer id exists.
Code
idx = transcript_clean[transcript_clean.offer_received == 0].index
transcript_clean['expiration'].iloc[idx] = None
transcript_clean.expiration = transcript_clean.expiration.fillna(value=transcript_clean.time)
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
idx = transcript_clean[(transcript_clean.offer_id != '')
& (transcript_clean.offer_received == 0)].index
transcript_clean['expiration'].iloc[idx] = None
transcript_clean.expiration = transcript_clean.expiration.fillna(method = 'ffill')
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
Test
transcript_clean.head(20)
5. We will use time columns information to create new columns: offer_received_time, offer_viewed_time, offer_completed_time
Code
transcript_clean['offer_received_time']=transcript_clean[transcript_clean.offer_received == 1]['time']
transcript_clean['offer_viewed_time']=transcript_clean[transcript_clean.offer_viewed == 1]['time']
transcript_clean['offer_completed_time']=transcript_clean[transcript_clean.offer_completed == 1]['time']
transcript_clean.offer_received_time.fillna(0,inplace=True)
transcript_clean.offer_viewed_time.fillna(0,inplace=True)
transcript_clean.offer_completed_time.fillna(0,inplace=True)
Test
transcript_clean.head(20)
6. A person can receive the same offer multiple times. To consolidate transaction records associated within offer expiration time, we will create a new column "offerid_expiration" and use this column to group the transactions.
Code
transcript_clean['offerid_expiration'] = ''
idx = transcript_clean[transcript_clean.offer_id != ''].index
transcript_clean['expiration'] = transcript_clean.expiration.astype(str)
transcript_clean['offerid_expiration'].iloc[idx] = transcript_clean['offer_id'].iloc[idx] + transcript_clean['expiration'].iloc[idx]
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
Test
transcript_clean.head(20)
7. Consolidate transaction records associated within offer expiration time
Code
transcript_time = transcript_clean.groupby(['person','offerid_expiration'], as_index=False)[['amount',
'offer_id',
'offer_received_time',
'offer_viewed_time',
'offer_completed_time']].max()
transcript_clean.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time'],
inplace=True)
transcript_clean =transcript_clean.merge(transcript_time,
left_on=['person','offerid_expiration'],
right_on = ['person','offerid_expiration'],
how= 'outer')
transcript_clean.fillna(0,inplace=True)
transcript_clean.drop(columns=['offerid_expiration','offer_id_y'],inplace=True)
transcript_clean.rename(columns={'offer_id_x':'offer_id'},inplace=True)
transcript_clean = transcript_clean.sort_values(by=['person','time'])
Test
transcript_clean.head(20)
8. We still have different transaction records for viewing/ completing. We will remove rows these rows as have already captured this information in offer received transaction.
Code
idx = transcript_clean[(transcript_clean.offer_id != '') & (transcript_clean.offer_received == 0)].index
transcript_clean.drop(labels=idx,inplace=True)
transcript_clean.reset_index(inplace=True,drop=True)
Test
transcript_clean.head(20)
9. When we consolidated the transactions, for purchases that were performed without coupon, "amount_y" column is populated by maximum amount spent by the person. We need to correct this.
Code
transcript_clean['amount']= transcript_clean[transcript_clean.offer_id == '']['amount_x']
transcript_clean['amount']= transcript_clean.amount.fillna(value=transcript_clean.amount_y)
transcript_clean.drop(columns=['amount_x','amount_y'],inplace=True)
Test
transcript_clean.head(20)
10. For regular transactions, we still have the expiration column populated. We will fill the expiration with 0.
Code
idx = transcript_clean[transcript_clean.offer_id == ''].index
transcript_clean['expiration'].iloc[idx] = 0
Test
transcript_clean.head(20)
11. A user is deemed to be influenced by promotion only after the individual made a transaction after viewing the advertisement. We will create a new column and populate if the promotion or not influence the individual.
Code
idx = transcript_clean[(transcript_clean.offer_viewed_time >0)
& (transcript_clean.offer_viewed_time > transcript_clean.offer_received_time)
& (transcript_clean.offer_completed_time > transcript_clean.offer_viewed_time)].index
transcript_clean['influenced'] = 0
transcript_clean['influenced'].iloc[idx] = 1
Test
transcript_clean.head(20)
12. Create a new column to capture transaction time.
Code
transcript_clean['offer_received_time'] = transcript_clean.offer_received_time.astype(int)
transcript_clean['offer_viewed_time'] = transcript_clean.offer_viewed_time.astype(int)
transcript_clean['offer_completed_time'] = transcript_clean.offer_completed_time.astype(int)
transcript_clean['transaction_time'] = 0
idx = transcript_clean[transcript_clean.transaction == 1].index
transcript_clean['transaction_time'].iloc[idx] = transcript_clean['time'].iloc[idx]
idx = transcript_clean[transcript_clean.transaction == 0].index
transcript_clean['transaction_time'].iloc[idx] = transcript_clean['offer_completed_time'].iloc[idx]
Test
transcript_clean.head(20)
13. When the transactions are consolidated, we lost information about offer_received, offer_viewed, offer_completed columns. We need to populate with correct values.
Code
transcript_clean['offer_received'] = 0
idx = transcript_clean[transcript_clean.offer_received_time > 0].index
transcript_clean['offer_received'].iloc[idx] = 1
transcript_clean['offer_viewed'] = 0
idx = transcript_clean[transcript_clean.offer_viewed_time > 0].index
transcript_clean['offer_viewed'].iloc[idx] = 1
transcript_clean['offer_completed'] = 0
idx = transcript_clean[transcript_clean.offer_completed_time > 0].index
transcript_clean['offer_completed'].iloc[idx] = 1
transcript_clean = transcript_clean[['person','offer_id', 'time','offer_received_time', 'offer_viewed_time',
'offer_completed_time','transaction_time','expiration','offer_received',
'offer_viewed','offer_completed', 'transaction', 'reward','amount',
'influenced']]
Test
transcript_clean.head(20)
14. We no longer need "time" and "expiration" information anymore. We will drop these columns.
Code
transcript_clean.drop(columns=['time','expiration'],inplace=True)
Test
transcript_clean.head(20)
We will consolidate all the cleaning steps into one single function.
def clean_transcript(transcript_clone = transcript.copy()):
'''
Create dummy columns out of event column
'''
transcript_clone =pd.get_dummies(transcript_clone,columns=['event'])
transcript_clone.rename(columns={'event_offer completed':'offer_completed',
'event_offer received':'offer_received',
'event_offer viewed':'offer_viewed',
'event_transaction':'transaction'},
inplace=True)
'''
"value" column is a composite column that contains Offer ID, Reward and Amount information. We will extract
information into individual columns
'''
transcript_clone['offer_id'] = transcript_clone.value.apply(get_offer_id)
transcript_clone['reward'] = transcript_clone.value.apply(get_reward)
transcript_clone['amount'] = transcript_clone.value.apply(get_amount)
transcript_clone.drop(columns=['value'],inplace=True)
transcript_clone = transcript_clone[['person','time','offer_id','offer_received','offer_viewed','offer_completed',
'transaction','reward','amount']]
'''
When an individual has utilized an offer, there are two transactions records created, one for claiming the
reward another for making the purchase. We are going to consolidate these two transaction records into one.
'''
transcript_clean = transcript_clone.groupby(['person','time'],as_index=False).agg('max')
'''
Each offer is valid only for a certain number of days once received. In the current data frame, we do not
have this information. For successful completion of the offer, the offer should be utilized before expiration.
'''
transcript_clean['duration'] = transcript_clean[transcript_clean.offer_received == 1].offer_id.apply(get_duration)
transcript_clean.duration.fillna(0,inplace=True)
transcript_clean['duration'] = transcript_clean.duration.apply(lambda x:x*24)
transcript_clean['expiration'] = transcript_clean.time + transcript_clean.duration
transcript_clean.drop(columns='duration',inplace=True)
transcript_clean = transcript_clean[['person', 'time', 'expiration','offer_id', 'offer_received', 'offer_viewed',
'offer_completed', 'transaction', 'reward', 'amount']]
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
'''
From the above output, it looks like we have populated the transactions that are not offer received with the
transaction timestamp. We need to fill with correct offer expiration time if offer id exists.
'''
idx = transcript_clean[transcript_clean.offer_received == 0].index
transcript_clean['expiration'].iloc[idx] = None
transcript_clean.expiration = transcript_clean.expiration.fillna(value=transcript_clean.time)
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
idx = transcript_clean[(transcript_clean.offer_id != '')
& (transcript_clean.offer_received == 0)].index
transcript_clean['expiration'].iloc[idx] = None
transcript_clean.expiration = transcript_clean.expiration.fillna(method = 'ffill')
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
'''
We will use time columns information to create new columns: offer_received_time, offer_viewed_time,
offer_completed_time
'''
transcript_clean['offer_received_time']=transcript_clean[transcript_clean.offer_received == 1]['time']
transcript_clean['offer_viewed_time']=transcript_clean[transcript_clean.offer_viewed == 1]['time']
transcript_clean['offer_completed_time']=transcript_clean[transcript_clean.offer_completed == 1]['time']
transcript_clean.offer_received_time.fillna(0,inplace=True)
transcript_clean.offer_viewed_time.fillna(0,inplace=True)
transcript_clean.offer_completed_time.fillna(0,inplace=True)
'''
A person can receive the same offer multiple times. To consolidate transaction records associated within
offer expiration time, we will create a new column "offerid_expiration" and use this column to group the
transactions.
'''
transcript_clean['offerid_expiration'] = ''
idx = transcript_clean[transcript_clean.offer_id != ''].index
transcript_clean['expiration'] = transcript_clean.expiration.astype(str)
transcript_clean['offerid_expiration'].iloc[idx] = transcript_clean['offer_id'].iloc[idx] + transcript_clean['expiration'].iloc[idx]
transcript_clean['expiration'] = transcript_clean.expiration.astype(int)
'''
Consolidate transaction records associated within offer expiration time
'''
transcript_time = transcript_clean.groupby(['person','offerid_expiration'], as_index=False)[['amount',
'offer_id',
'offer_received_time',
'offer_viewed_time',
'offer_completed_time']].max()
transcript_clean.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time'],
inplace=True)
transcript_clean =transcript_clean.merge(transcript_time,
left_on=['person','offerid_expiration'],
right_on = ['person','offerid_expiration'],
how= 'outer')
transcript_clean.fillna(0,inplace=True)
transcript_clean = transcript_clean.sort_values(by=['person','time'])
transcript_clean.drop(columns=['offerid_expiration','offer_id_y'],inplace=True)
transcript_clean.rename(columns={'offer_id_x':'offer_id'},inplace=True)
'''
We still have different transaction records for viewing/ completing. We will remove rows these rows as have
already captured this information in offer received transaction.
'''
idx = transcript_clean[(transcript_clean.offer_id != '') & (transcript_clean.offer_received == 0)].index
transcript_clean.drop(labels=idx,inplace=True)
transcript_clean.reset_index(inplace=True,drop=True)
'''
When we consolidated the transactions, for purchases that were performed without coupon, "amount_y" column is
populated by maximum amount spent by the person. We need to correct this.
'''
transcript_clean['amount']= transcript_clean[transcript_clean.offer_id == '']['amount_x']
transcript_clean['amount']= transcript_clean.amount.fillna(value=transcript_clean.amount_y)
transcript_clean.drop(columns=['amount_x','amount_y'],inplace=True)
'''
For regular transactions, we still have the expiration column populated. We will fill the expiration with 0.
'''
idx = transcript_clean[transcript_clean.offer_id == ''].index
transcript_clean['expiration'].iloc[idx] = 0
'''
A user is deemed to be influenced by promotion only after the individual made a transaction after viewing the
advertisement. We will create a new column and populate if the promotion or not influence the individual.
'''
idx = transcript_clean[(transcript_clean.offer_viewed_time >0)
& (transcript_clean.offer_viewed_time > transcript_clean.offer_received_time)
& (transcript_clean.offer_completed_time > transcript_clean.offer_viewed_time)].index
transcript_clean['influenced'] = 0
transcript_clean['influenced'].iloc[idx] = 1
'''
Create a new column to capture transaction time.
'''
transcript_clean['offer_received_time'] = transcript_clean.offer_received_time.astype(int)
transcript_clean['offer_viewed_time'] = transcript_clean.offer_viewed_time.astype(int)
transcript_clean['offer_completed_time'] = transcript_clean.offer_completed_time.astype(int)
transcript_clean['transaction_time'] = 0
idx = transcript_clean[transcript_clean.transaction == 1].index
transcript_clean['transaction_time'].iloc[idx] = transcript_clean['time'].iloc[idx]
idx = transcript_clean[transcript_clean.transaction == 0].index
transcript_clean['transaction_time'].iloc[idx] = transcript_clean['offer_completed_time'].iloc[idx]
'''
When the transactions are consolidated, we lost information about offer_received, offer_viewed,
offer_completed columns. We need to populate with correct values.
'''
transcript_clean['offer_received'] = 0
idx = transcript_clean[transcript_clean.offer_received_time > 0].index
transcript_clean['offer_received'].iloc[idx] = 1
transcript_clean['offer_viewed'] = 0
idx = transcript_clean[transcript_clean.offer_viewed_time > 0].index
transcript_clean['offer_viewed'].iloc[idx] = 1
transcript_clean['offer_completed'] = 0
idx = transcript_clean[transcript_clean.offer_completed_time > 0].index
transcript_clean['offer_completed'].iloc[idx] = 1
transcript_clean = transcript_clean[['person','offer_id', 'time','offer_received_time', 'offer_viewed_time',
'offer_completed_time','transaction_time','expiration','offer_received',
'offer_viewed','offer_completed', 'transaction', 'reward','amount',
'influenced']]
'''
We no longer need "time" and "expiration" information. We will drop these columns.
'''
transcript_clean.drop(columns=['time','expiration'],inplace=True)
del transcript_clone
del transcript_time
gc.collect()
return transcript_clean
#transcript_clean_1 = clean_transcript()
#transcript_clean.equals(transcript_clean_1)
del transcript_clone
del transcript_time
#del transcript_clean_1
gc.collect()
transcript_clean.to_csv('data/transcript_clean.csv',index=False)
Now that we have all three data frames cleaned, lets consolidate into one data frame
transaction = transcript_clean.groupby(['person','offer_id'],as_index=False).sum()
transaction.influenced.replace(to_replace=2, value=1,inplace=True)
transaction.influenced.replace(to_replace=3, value=1,inplace=True)
transaction.influenced.value_counts()
transaction.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time',
'transaction_time'], inplace = True)
transaction = transaction.merge(profile_for_ml,left_on='person',right_on='id')
transaction.drop(columns=['person','id'],inplace=True)
transaction = transaction.merge(portfolio_for_ml,left_on=['offer_id'],right_on=['id'], how= 'left')
transaction.drop(columns=['offer_received','offer_viewed','offer_completed','transaction',
'offer_id','id','reward_y'],inplace=True)
transaction.fillna(0,inplace=True)
transaction[['difficulty','duration', 'bogo', 'discount', 'informational', 'email', 'mobile',
'social', 'web', 'offer_code']] = transaction[['difficulty','duration', 'bogo', 'discount',
'informational', 'email', 'mobile','social', 'web',
'offer_code']].astype(int)
transaction.rename(columns={'reward_x':'reward'},inplace=True)
transaction = pd.get_dummies(transaction, columns=['offer_code'])
transaction = transaction[['age', 'income', 'gender_F','gender_M', 'gender_O', 'became_member_on_year',
'became_member_on_month','became_member_on_date','difficulty','duration',
'bogo', 'discount', 'informational', 'email', 'mobile','social', 'web','reward', 'amount',
'influenced','offer_code_0', 'offer_code_1', 'offer_code_2','offer_code_3', 'offer_code_4',
'offer_code_5', 'offer_code_6','offer_code_7', 'offer_code_8', 'offer_code_9',
'offer_code_10']]
We will consolidate all the cleaning steps into one single function.
def generate_transaction_without_dummies(transcript_clean,profile_for_ml,portfolio_for_ml):
transaction = transcript_clean.groupby(['person','offer_id'],as_index=False).sum()
transaction.influenced.replace(to_replace=2, value=1,inplace=True)
transaction.influenced.replace(to_replace=3, value=1,inplace=True)
transaction.drop(columns=['offer_received_time','offer_viewed_time','offer_completed_time',
'transaction_time'],
inplace = True)
transaction = transaction.merge(profile_for_ml,left_on='person',right_on='id')
transaction.drop(columns=['person','id'],inplace=True)
transaction = transaction.merge(portfolio_for_ml,left_on=['offer_id'],right_on=['id'], how= 'left')
transaction.drop(columns=['offer_received','offer_viewed','offer_completed','transaction','offer_id','id',
'reward_y'],
inplace=True)
transaction.fillna(0,inplace=True)
transaction[['difficulty','duration', 'bogo', 'discount', 'informational', 'email', 'mobile','social', 'web',
'offer_code']] = transaction[['difficulty','duration', 'bogo', 'discount', 'informational', 'email',
'mobile','social', 'web', 'offer_code']].astype(int)
transaction.rename(columns={'reward_x':'reward'},inplace=True)
return transaction
def generate_transaction(transcript_clean,profile_for_ml,portfolio_for_ml):
transaction = generate_transaction_without_dummies(transcript_clean,profile_for_ml,portfolio_for_ml)
transaction = pd.get_dummies(transaction, columns=['offer_code'])
transaction = transaction[['age', 'income', 'gender_F','gender_M', 'gender_O', 'became_member_on_year',
'became_member_on_month','became_member_on_date','difficulty','duration',
'bogo', 'discount', 'informational', 'email', 'mobile','social', 'web','reward', 'amount',
'influenced','offer_code_0', 'offer_code_1', 'offer_code_2','offer_code_3', 'offer_code_4',
'offer_code_5', 'offer_code_6','offer_code_7', 'offer_code_8', 'offer_code_9',
'offer_code_10']]
return transaction
transaction_1 = generate_transaction(transcript_clean,profile_for_ml,portfolio_for_ml)
transaction.equals(transaction_1)
del transaction_1
gc.collect()
Data analysis provides critical insights into the data and answers pertinent business questions. In this section, we will perform multivariant frequency distribution by:
* events by gender
* events by income
* gender by income
transcript_by_person = transcript_clean.groupby('person',as_index=False).sum()
transcript_by_person.drop(columns=['offer_received_time', 'offer_viewed_time','offer_completed_time',
'transaction_time'],
inplace = True)
transcript_by_person = profile.merge(transcript_by_person,left_on='id',right_on='person')
transcript_by_person.drop(columns=['id', 'person'],
inplace = True)
event_by_gender = transcript_by_person.groupby('gender',as_index=False)[['offer_completed','offer_received',
'offer_viewed','transaction',
'influenced']].sum()
event_by_gender.gender.replace(to_replace='F',value='Female',inplace=True)
event_by_gender.gender.replace(to_replace='M',value='Male',inplace=True)
event_by_gender.gender.replace(to_replace='O',value='Other',inplace=True)
trace0 = go.Bar(x=event_by_gender.gender,
y=event_by_gender.offer_received,
text=event_by_gender.offer_received,
textposition = 'auto',
name = 'Offer Received',
marker=dict(color='rgba(0,107,164,1)',))
trace1 = go.Bar(x=event_by_gender.gender,
y=event_by_gender.offer_viewed,
text=event_by_gender.offer_viewed,
textposition = 'auto',
name = 'Offer Viewed',
marker = dict(color ='rgba(255,128,14,1)',))
trace2 = go.Bar(x=event_by_gender.gender,
y=event_by_gender.offer_completed,
text=event_by_gender.offer_completed,
textposition = 'auto',
name = 'Offer Completed',
marker = dict(color ='rgba(171,171,171,1)',))
trace3 = go.Bar(x=event_by_gender.gender,
y=event_by_gender.transaction,
text=event_by_gender.transaction,
textposition = 'auto',
name = 'Transaction',
marker = dict(color ='rgba(89,89,89,1)',))
trace4 = go.Bar(x=event_by_gender.gender,
y=event_by_gender.influenced,
text=event_by_gender.influenced,
textposition = 'auto',
name = 'Influenced',
marker = dict(color ='rgba(95,158,209,1)',))
layout = go.Layout(
title = 'Event Distribution by Gender',
xaxis=dict(tickangle=-45),
barmode='group',
bargap=0.15
)
fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3,trace4]),
layout=layout)
iplot(fig)
del event_by_gender
gc.collect()
event_by_age = transcript_by_person.groupby('age',as_index=False)[['offer_completed','offer_received',
'offer_viewed','transaction',
'influenced']].sum()
trace0 = go.Scatter(x=transcript_by_person.age,
y=transcript_by_person.offer_received,
name = 'Offer Received',
mode = 'markers',
marker=dict(color='rgba(0,107,164,1)',))
trace1 = go.Scatter(x=transcript_by_person.age,
y=transcript_by_person.offer_viewed,
name = 'Offer Viewed',
mode = 'markers',
marker = dict(color ='rgba(255,128,14,1)',))
trace2 = go.Scatter(x=transcript_by_person.age,
y=transcript_by_person.offer_completed,
name = 'Offer Completed',
mode = 'markers',
marker = dict(color ='rgba(171,171,171,1)',))
trace3 = go.Scatter(x=transcript_by_person.age,
y=transcript_by_person.transaction,
name = 'Transaction',
mode = 'markers',
marker = dict(color ='rgba(89,89,89,1)',))
trace4 = go.Scatter(x=transcript_by_person.age,
y=transcript_by_person.influenced,
name = 'Influenced',
mode = 'markers',
marker = dict(color ='rgba(95,158,209,1)',))
layout = go.Layout(
title = 'Event Distribution by Age',
bargap=0.15
)
fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3,trace4]),
layout=layout)
iplot(fig)
del event_by_age
gc.collect()
event_by_income = transcript_by_person.groupby('income',as_index=False)[['offer_completed','offer_received',
'offer_viewed','transaction',
'influenced']].sum()
trace0 = go.Scatter(x=event_by_income.income,
y=event_by_income.offer_received,
name = 'Offer Received',
marker=dict(color='rgba(0,107,164,1.25)',))
trace1 = go.Scatter(x=event_by_income.income,
y=event_by_income.offer_viewed,
name = 'Offer Viewed',
marker = dict(color ='rgba(255,128,14,1.25)',))
trace2 = go.Scatter(x=event_by_income.income,
y=event_by_income.offer_completed,
name = 'Offer Completed',
marker = dict(color ='rgba(200,82,0,1.25)',))
trace3 = go.Scatter(x=event_by_income.income,
y=event_by_income.transaction,
name = 'Transaction',
marker = dict(color ='rgba(89,89,89,1.25)',))
trace4 = go.Scatter(x=event_by_income.income,
y=event_by_income.influenced,
name = 'Influenced',
marker = dict(color ='rgba(95,158,209,1.25)',))
trace = go.Histogram(x=profile.income.values,
name='Income',
marker=dict(color='rgba(95,158,209,0.15)',),
yaxis='y2')
layout = go.Layout(
title = 'Event Distribution by Income',
yaxis2=dict(
overlaying='y',
side='right'
)
)
fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3,trace4,trace]),
layout=layout
)
iplot(fig)
del event_by_income
del transcript_by_person
gc.collect()
transcript_by_offer = transcript_clean.groupby('offer_id',as_index=False).sum()
transcript_by_offer = portfolio.merge(transcript_by_offer,left_on='id',right_on='offer_id')
transcript_by_offer.drop(columns=['channels', 'difficulty', 'duration','reward_x','offer_id', 'offer_received_time',
'offer_viewed_time','offer_completed_time', 'transaction_time','transaction',
'reward_y', 'amount'],
inplace=True)
transcript_by_offer = transcript_by_offer.groupby('offer_type',as_index=False).sum()
trace0 = go.Bar(x=transcript_by_offer.offer_type,
y=transcript_by_offer.offer_received,
text=transcript_by_offer.offer_received,
textposition = 'auto',
name = 'Offer Received',
marker=dict(color='rgba(0,107,164,1)',))
trace1 = go.Bar(x=transcript_by_offer.offer_type,
y=transcript_by_offer.offer_viewed,
text=transcript_by_offer.offer_viewed,
textposition = 'auto',
name = 'Offer Viewed',
marker = dict(color ='rgba(255,128,14,1)',))
trace2 = go.Bar(x=transcript_by_offer.offer_type,
y=transcript_by_offer.offer_completed,
text=transcript_by_offer.offer_completed,
textposition = 'auto',
name = 'Offer Completed',
marker = dict(color ='rgba(171,171,171,1)',))
trace3 = go.Bar(x=transcript_by_offer.offer_type,
y=transcript_by_offer.influenced,
text=transcript_by_offer.influenced,
textposition = 'auto',
name = 'Influenced',
marker = dict(color ='rgba(89,89,89,1)',))
layout = go.Layout(
title = 'Event Distribution by Offer Type',
xaxis=dict(tickangle=-45),
barmode='group',
bargap=0.15
)
fig = go.Figure(data=go.Data([trace0,trace1,trace2,trace3]),
layout=layout)
iplot(fig)
del transcript_by_offer
gc.collect()
Modeling techniques are selected and applied. Since some methods like neural nets have specific requirements regarding the form of the data, there can be a loopback here to the data preparation phase. Modeling is not a mandatory step and is solely dependent on the scope of the project. In this workbook, I am going to build a machine learning model that:
All three models are trained on ensembled models.
For classification models, due to the imbalance in the target classes, we are going to use Precision, Recall, and F1 score values as the evaluation scores.
For regression models, we are going to use mean squared error and R2 values as the evaluation scores.
features = transaction.columns.drop(['influenced'])
X = transaction[features]
y = transaction['influenced']
numerical_columns = ['age', 'income','became_member_on_year', 'became_member_on_month','became_member_on_date',
'difficulty','duration','reward','amount']
features = transaction.columns.drop(['age', 'income', 'gender_F', 'gender_M', 'gender_O','became_member_on_year',
'became_member_on_month','became_member_on_date','duration', 'bogo', 'discount',
'informational', 'email', 'mobile', 'social', 'web','influenced','offer_code_0',
'offer_code_1', 'offer_code_2','offer_code_3', 'offer_code_4', 'offer_code_5',
'offer_code_6','offer_code_7', 'offer_code_8', 'offer_code_9', 'offer_code_10'])
X = transaction[features]
y = transaction['influenced']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
transformer = make_union(StandardScaler())
clf = RandomForestClassifier()
pipeline = Pipeline([
('transformer',transformer),
('classifier',clf)
])
parameters = [
{
"classifier__n_estimators": range(10,110,10)
},
{
"classifier": [AdaBoostClassifier()],
"classifier__n_estimators": range(10,110,10),
"classifier__learning_rate":np.linspace(0.1,2.5,20)
},
{
"classifier": [ExtraTreesClassifier()],
"classifier__n_estimators": range(10,110,10)
},
{
"classifier": [GradientBoostingClassifier()],
"classifier__n_estimators": range(10,110,10),
"classifier__learning_rate":np.linspace(0.1,2.5,20)
}
]
clf = AdaBoostClassifier()
pipeline = Pipeline([
('classifier',clf)
])
parameters = [
{
"classifier__n_estimators": [10],
"classifier__learning_rate":[1.8684210526315792]
}
]
scoring = make_scorer(f1_score)
# Change n_jobs to -1 if you're running more than or less than 8 core cpu.
gridSearch = GridSearchCV(pipeline,
parameters,
verbose=2,
n_jobs = 6,
# n_jobs = -1,
cv = 5,
scoring=scoring,
return_train_score=True)
%%time
influnce_clf = gridSearch.fit(X_train,y_train)
y_pred = influnce_clf.predict(X_test)
y_train_pred = influnce_clf.predict(X_train)
print(classification_report(y_true=y_train,y_pred= y_train_pred))
print(classification_report(y_true=y_test,y_pred= y_pred))
accuracy_score(y_true=y_train,y_pred= y_train_pred), accuracy_score(y_true=y_test,y_pred= y_pred)
f1_score(y_true=y_train,y_pred= y_train_pred), f1_score(y_true=y_test,y_pred= y_pred)
influnce_clf.best_estimator_
influnce_clf_model = influnce_clf.best_estimator_.get_params('classifier')['classifier']
influnce_clf_model.feature_importances_
df = pd.DataFrame(list(zip(X.columns,influnce_clf_model.feature_importances_)),columns=['Feature','Importance'])
trace0 = go.Bar(x=df.Feature,
y=df.Importance,
text=df.Importance,
textposition = 'auto',
name = 'Offer Received',
marker=dict(color='rgba(0,107,164,1)',))
layout = go.Layout(
title = 'Feature Importance for Model to predict Influence',
xaxis=dict(tickangle=-45),
barmode='group',
bargap=0.15
)
fig = go.Figure(data=go.Data([trace0]),
layout=layout)
iplot(fig)
##Uncomment if needed
#dump(influnce_clf, 'influnce_clf.joblib')
features = transaction.columns.drop(['amount','influenced'])
X = transaction[features]
y = transaction['amount']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
transformer = make_union(StandardScaler())
clf = RandomForestRegressor()
pipeline = Pipeline([
('transformer',transformer),
('classifier',clf)
])
parameters = [
{
"classifier__n_estimators": range(10,110,10)
},
{
"classifier": [AdaBoostRegressor()],
"classifier__n_estimators": range(10,110,10),
"classifier__learning_rate":np.linspace(0.1,2.5,20)
},
{
"classifier": [GradientBoostingRegressor()],
"classifier__n_estimators": range(10,110,10),
"classifier__learning_rate":np.linspace(0.1,2.5,20)
}
]
clf = GradientBoostingRegressor()
pipeline = Pipeline([
('transformer',transformer),
('classifier',clf)
])
parameters = [
{
"classifier__n_estimators": range(90,130,10),
"classifier__learning_rate":[0.1]
}
]
scoring = make_scorer(r2_score)
gridSearch = GridSearchCV(pipeline,
parameters,
verbose=2,
n_jobs = 6,
# n_jobs = -1,
cv = 5,
scoring=scoring,
# refit='F1',
return_train_score=True)
%%time
amount_clf = gridSearch.fit(X_train,y_train)
y_pred = amount_clf.predict(X_test)
y_train_pred = amount_clf.predict(X_train)
r2_score(y_true=y_test,y_pred=y_pred), r2_score(y_true=y_train,y_pred=y_train_pred)
mean_squared_error(y_true=y_test,y_pred=y_pred),mean_squared_error(y_true=y_train,y_pred=y_train_pred)
amount_clf
amount_clf.best_estimator_
amount_clf_model = amount_clf.best_estimator_.get_params('classifier')['classifier']
amount_clf_model.feature_importances_
df = pd.DataFrame(list(zip(X.columns,amount_clf_model.feature_importances_)),columns=['Feature','Importance'])
trace0 = go.Bar(x=df.Feature,
y=df.Importance,
name = 'Offer Received',
marker=dict(color='rgba(0,107,164,1)',))
layout = go.Layout(
title = 'Feature Importance for Model to predict Amount',
xaxis=dict(tickangle=-45),
barmode='group',
bargap=0.15
)
fig = go.Figure(data=go.Data([trace0]),
layout=layout)
iplot(fig)
##Uncomment if needed
#dump(amount_clf, 'amount_clf.joblib')
transaction_for_offer = generate_transaction_without_dummies(transcript_clean, profile_for_ml, portfolio_for_ml)
features = transaction_for_offer.columns.drop(['difficulty', 'duration', 'bogo', 'discount','informational', 'email',
'mobile','social', 'web', 'reward','offer_code'])
X = transaction_for_offer[features]
y = transaction_for_offer['offer_code']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf = RandomForestClassifier()
pipeline = Pipeline([
('classifier',clf)
])
parameters = [
{
"classifier__n_estimators": range(10,110,10)
},
{
"classifier": [AdaBoostClassifier()],
"classifier__n_estimators": range(10,110,10)
},
{
"classifier": [ExtraTreesClassifier()],
"classifier__n_estimators": range(10,110,10)
},
{
"classifier": [GradientBoostingRegressor()],
"classifier__n_estimators": range(10,110,10),
"classifier__learning_rate":np.linspace(0.1,2.5,20)
}
]
clf = AdaBoostClassifier()
pipeline = Pipeline([
('classifier',clf)
])
parameters = [
{
"classifier": [AdaBoostClassifier()],
"classifier__n_estimators": [10]
}
]
#scoring = make_scorer(f1_score)
# Change n_jobs to -1 if you're running more than or less than 8 core cpu.
gridSearch = GridSearchCV(pipeline,
parameters,
verbose=2,
n_jobs = 6,
# n_jobs = -1,
cv = 5,
# scoring=scoring,
return_train_score=True)
%%time
offer_code_clf = gridSearch.fit(X_train,y_train)
y_pred = offer_code_clf.predict(X_test)
y_train_pred = offer_code_clf.predict(X_train)
print(classification_report(y_true=y_train,y_pred= y_train_pred))
print(classification_report(y_true=y_test,y_pred= y_pred))
accuracy_score(y_true=y_train,y_pred= y_train_pred), accuracy_score(y_true=y_test,y_pred= y_pred)
offer_code_clf.best_estimator_
offer_code_clf_model = offer_code_clf.best_estimator_.get_params('classifier')['classifier']
offer_code_clf_model.feature_importances_
df = pd.DataFrame(list(zip(X.columns,offer_code_clf_model.feature_importances_)),columns=['Feature','Importance'])
trace0 = go.Bar(x=df.Feature,
y=df.Importance,
name = 'Offer Received',
marker=dict(color='rgba(0,107,164,1)',))
layout = go.Layout(
title = 'Feature Importance for Model to predict Offer Code',
xaxis=dict(tickangle=-45),
barmode='group',
bargap=0.15
)
fig = go.Figure(data=go.Data([trace0]),
layout=layout)
iplot(fig)
#Uncomment if needed
#dump(offer_code_clf, 'offer_code_clf.joblib')
Once one or more models appear to have a high quality based on loss functions, these need to be tested to ensure they generalize against unseen data and that all critical business issues. The result is the selection of the champion model(s).
When we first started working on the data, we came up with three potential models. Once the models are trained, only one of them is useful.
Model to predict whether an individual is influenced by promotion or not is highly dependent on the amount spent. This model is highly reliant on after the action (purchase). Therefore it is not useful.
Model to predict what will be the best offer for an individual is yielding very low scores on testing data and decent scores on the training data ( tending towards overfitting/ High Variance). Because of this reason this model is discarded.
Model to predict the purchasing habits of individuals yielded similar scores on both training and testing data (thereby achieving a good trade-off between bias and variance). We are going to use this model in the Web application to make the predictions.
There are a few takeaways:
Generally, this will mean deploying a code representation of the model into an operating system to score or categorize new unseen data as it arises and to create a mechanism for the use of that new information in the solution of the original business problem. Importantly, the code representation must also include all the data prep steps leading up to modeling so that the model will treat new raw data in the same manner as during model development.
I have created the web application which will utilize the data analysis and the pretrained models. A complete description of the steps that need to be followed to launch the web application is mentioned in the README.MD file.
Please find the screenshot from the web application.
Path: http://0.0.0.0:3001/

Path: http://0.0.0.0:3001/predict_amt

If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.
You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.